MKTG 585R Quantitative Marketing Pre-PhD Seminar
We import, wrangle, visualize, and model data using functions. Functions are composed of arguments that tell the function how to operate.
The tidyverse is a collection of R packages that share common philosophies and are designed to work together. An R package is a collection of functions, documentation, and sometimes data.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Create a new project using Version Control > Git.
Using a .Rproj file helps us keep track of the working directory as well as safely deal with file paths. File paths can be tricky because they are operating system and user-specific. We can install the here package and use here() in conjunction with an RStudio project to write file paths for us.
Whenever we need a file path, we can use here() and build up from the working directory. For example, if customer_data.csv is in the Data folder in your working directory:
Rows: 10531 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): gender, married, college_degree, region, state, review_time, review...
dbl (5): customer_id, birth_year, income, credit, star_rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
customer_data now appears in our Environment.<-.# A tibble: 10,531 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1001 1971 Female 73000 742. No No South
2 1002 1970 Female 31000 749. Yes No West
3 1003 1988 Male 35000 542. No No South
4 1004 1984 Other 64000 574. Yes Yes Midwest
5 1005 1987 Male 58000 644. No Yes West
6 1006 1994 Male 164000 554. Yes Yes Midwest
7 1007 1968 Male 39000 608. No No Midwest
8 1008 1994 Male 69000 710. No No South
9 1009 1958 Male 233000 702. No No West
10 1010 1994 Female 77000 605. Yes No Northeast
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
Rows: 10,531
Columns: 13
$ customer_id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1…
$ birth_year <dbl> 1971, 1970, 1988, 1984, 1987, 1994, 1968, 1994, 1958, 1…
$ gender <chr> "Female", "Female", "Male", "Other", "Male", "Male", "M…
$ income <dbl> 73000, 31000, 35000, 64000, 58000, 164000, 39000, 69000…
$ credit <dbl> 742.0827, 749.3514, 542.2399, 573.9358, 644.2439, 553.6…
$ married <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No"…
$ college_degree <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No"…
$ region <chr> "South", "West", "South", "Midwest", "West", "Midwest",…
$ state <chr> "DC", "WA", "AR", "MN", "HI", "MN", "MN", "KY", "NM", "…
$ star_rating <dbl> 4, NA, NA, NA, 5, 2, NA, 5, NA, 5, NA, 5, NA, NA, NA, N…
$ review_time <chr> "06 11, 2015", NA, NA, NA, "03 25, 2008", "06 7, 2013",…
$ review_title <chr> "Four Stars", NA, NA, NA, "Great Product!!", "Not at al…
$ review_text <chr> "everything's fine", NA, NA, NA, "I looked all over the…
We often want to filter our data to keep certain observations using column values.
# A tibble: 7,782 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1004 1984 Other 64000 574. Yes Yes Midwest
2 1005 1987 Male 58000 644. No Yes West
3 1006 1994 Male 164000 554. Yes Yes Midwest
4 1013 1974 Female 197000 680. Yes Yes West
5 1015 1995 Male 347000 658. No Yes West
6 1016 1984 Other 292000 626. Yes Yes Northeast
7 1017 1997 Male 68000 588. No Yes Midwest
8 1018 1984 Male 184000 666. No Yes Northeast
9 1019 1984 Female 60000 758. No Yes West
10 1021 1997 Female 20000 503. No Yes West
# ℹ 7,772 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
region != "West".gender == "Female", income > 70000.Sometimes we want to slice our data to keep certain observations using their position in the data.
# A tibble: 5 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1001 1971 Female 73000 742. No No South
2 1002 1970 Female 31000 749. Yes No West
3 1003 1988 Male 35000 542. No No South
4 1004 1984 Other 64000 574. Yes Yes Midwest
5 1005 1987 Male 58000 644. No Yes West
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
We can arrange observations to reveal helpful information and check data. By default this is ascending order, bu we can use the helper function desc() to modify the output.
# A tibble: 10,531 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 9585 1939 Female 182000 850 Yes No Northeast
2 2409 1947 Male 227000 807. Yes Yes Midwest
3 4777 1947 Male 169000 651. Yes Yes Northeast
4 5165 1947 Female 258000 850 No Yes Northeast
5 8277 1947 Male 92000 839. Yes No West
6 8525 1947 Female 170000 723. Yes Yes West
7 9437 1947 Female 95000 850 No Yes West
8 10069 1947 Other 84000 743. Yes Yes West
9 4292 1948 Female 169000 760. Yes No Northeast
10 5282 1948 Male 176000 705. Yes No South
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
Sometimes we only care to keep certain variables, especially when working with a large dataset.
# A tibble: 10,531 × 6
region state star_rating review_time review_title review_text
<chr> <chr> <dbl> <chr> <chr> <chr>
1 South DC 4 06 11, 2015 Four Stars everything…
2 West WA NA <NA> <NA> <NA>
3 South AR NA <NA> <NA> <NA>
4 Midwest MN NA <NA> <NA> <NA>
5 West HI 5 03 25, 2008 Great Product!! I looked a…
6 Midwest MN 2 06 7, 2013 Not at all flattering on… I ordered …
7 Midwest MN NA <NA> <NA> <NA>
8 South KY 5 04 20, 2016 Super comfortable mini p… Super comf…
9 West NM NA <NA> <NA> <NA>
10 Northeast VT 5 10 18, 2015 nice strong enough Yeah
# ℹ 10,521 more rows
There are a number of helper functions you can use with select(), including starts_with(), contains(), matches(), num_range(), all_of(), any_of(), and everything().
We can also recode existing variables or create new variables.
# A tibble: 10,531 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1001 1971 Female 73 742. No No South
2 1002 1970 Female 31 749. Yes No West
3 1003 1988 Male 35 542. No No South
4 1004 1984 Other 64 574. Yes Yes Midwest
5 1005 1987 Male 58 644. No Yes West
6 1006 1994 Male 164 554. Yes Yes Midwest
7 1007 1968 Male 39 608. No No Midwest
8 1008 1994 Male 69 710. No No South
9 1009 1958 Male 233 702. No No West
10 1010 1994 Female 77 605. Yes No Northeast
# ℹ 10,521 more rows
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
Two important things to remember:
_. For example: good_name2 and 2bad name. Bad names should be renamed (but can be referenced if surrounded with ``).A common variable (e.g., an ID) allows us to join two data frames.
Rows: 10531 Columns: 169
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (169): customer_id, jan_2005, feb_2005, mar_2005, apr_2005, may_2005, ju...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10,531
Columns: 169
$ customer_id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010…
$ jan_2005 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
$ feb_2005 <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 0, 4, 0…
$ mar_2005 <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 5, 0…
$ apr_2005 <dbl> 4, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 1, 0, 0…
$ may_2005 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ jun_2005 <dbl> 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0…
$ jul_2005 <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ aug_2005 <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0…
$ sep_2005 <dbl> 0, 0, 0, 0, 0, 4, 0, 5, 4, 0, 0, 0, 0, 2, 0, 5, 0, 0, 0, 0…
$ oct_2005 <dbl> 4, 0, 0, 0, 3, 0, 5, 2, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0, 4…
$ nov_2005 <dbl> 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 5, 3, 0, 2, 2, 0, 2, 0, 5, 0…
$ dec_2005 <dbl> 3, 0, 1, 0, 0, 2, 1, 2, 5, 0, 0, 5, 0, 2, 4, 0, 2, 2, 3, 0…
$ jan_2006 <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 4, 2, 3, 0, 0, 4, 4…
$ feb_2006 <dbl> 2, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5, 0, 4, 0, 0, 0, 0, 0, 0, 0…
$ mar_2006 <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ apr_2006 <dbl> 0, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 5, 3…
$ may_2006 <dbl> 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0…
$ jun_2006 <dbl> 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 4, 0, 5…
$ jul_2006 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 4…
$ aug_2006 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ sep_2006 <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0…
$ oct_2006 <dbl> 0, 2, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ nov_2006 <dbl> 0, 0, 0, 4, 1, 1, 0, 0, 0, 3, 3, 3, 5, 1, 0, 0, 0, 0, 0, 0…
$ dec_2006 <dbl> 5, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 5, 3, 0, 0, 0, 0, 3…
$ jan_2007 <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 0, 0, 4, 4, 0, 0, 0, 0, 4, 3…
$ feb_2007 <dbl> 2, 0, 0, 0, 2, 0, 0, 3, 4, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0…
$ mar_2007 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0…
$ apr_2007 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 1…
$ may_2007 <dbl> 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ jun_2007 <dbl> 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1, 3, 1, 5…
$ jul_2007 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 0, 0, 2, 0, 0, 0, 0, 2…
$ aug_2007 <dbl> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ sep_2007 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 3…
$ oct_2007 <dbl> 3, 3, 0, 0, 0, 3, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0, 5, 0, 0…
$ nov_2007 <dbl> 0, 2, 0, 5, 0, 5, 0, 0, 0, 5, 1, 0, 4, 0, 0, 2, 2, 3, 0, 3…
$ dec_2007 <dbl> 0, 4, 0, 0, 4, 0, 0, 0, 0, 0, 0, 4, 0, 1, 4, 0, 4, 0, 3, 0…
$ jan_2008 <dbl> 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 0, 0, 4, 0, 0, 0, 0…
$ feb_2008 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ mar_2008 <dbl> 5, 5, 0, 0, 0, 0, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ apr_2008 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0, 0, 1, 1…
$ may_2008 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0…
$ jun_2008 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0…
$ jul_2008 <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 0, 4, 0, 2, 4, 0, 2, 0…
$ aug_2008 <dbl> 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 5, 0, 2, 0, 4, 0, 3, 0, 1, 0…
$ sep_2008 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0, 0, 5, 0, 0, 0, 2…
$ oct_2008 <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 5, 4, 2, 0, 3, 0, 0, 3, 3, 0, 0, 4…
$ nov_2008 <dbl> 1, 0, 0, 2, 2, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 0, 2, 4, 0…
$ dec_2008 <dbl> 0, 0, 0, 0, 2, 3, 1, 0, 0, 0, 2, 2, 0, 4, 4, 0, 1, 0, 5, 1…
$ jan_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ feb_2009 <dbl> 2, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0…
$ mar_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0…
$ apr_2009 <dbl> 3, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 5…
$ may_2009 <dbl> 0, 0, 0, 0, 2, 4, 0, 4, 0, 0, 0, 0, 0, 5, 4, 2, 0, 2, 1, 0…
$ jun_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ jul_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 3, 0, 1, 5, 3…
$ aug_2009 <dbl> 0, 4, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3…
$ sep_2009 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 0, 0…
$ oct_2009 <dbl> 5, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0…
$ nov_2009 <dbl> 5, 0, 0, 0, 0, 1, 0, 2, 2, 0, 4, 2, 0, 3, 4, 0, 4, 1, 2, 4…
$ dec_2009 <dbl> 0, 0, 0, 3, 3, 0, 0, 4, 1, 4, 2, 2, 0, 0, 3, 0, 0, 0, 5, 1…
$ jan_2010 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 5…
$ feb_2010 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0…
$ mar_2010 <dbl> 4, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
$ apr_2010 <dbl> 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 3, 0, 2, 0…
$ may_2010 <dbl> 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0…
$ jun_2010 <dbl> 4, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 3…
$ jul_2010 <dbl> 0, 0, 4, 0, 0, 0, 5, 5, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0…
$ aug_2010 <dbl> 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 4, 3, 0, 0, 0, 0, 0, 0…
$ sep_2010 <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0, 3…
$ oct_2010 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 4, 0, 0, 0, 0, 2, 0…
$ nov_2010 <dbl> 0, 0, 1, 2, 3, 5, 4, 1, 0, 4, 0, 3, 5, 0, 0, 0, 0, 3, 3, 0…
$ dec_2010 <dbl> 0, 0, 5, 3, 0, 0, 5, 0, 0, 0, 0, 0, 4, 4, 0, 5, 3, 0, 0, 5…
$ jan_2011 <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 5, 2, 1…
$ feb_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0…
$ mar_2011 <dbl> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 4…
$ apr_2011 <dbl> 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 3, 0…
$ may_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4…
$ jun_2011 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0…
$ jul_2011 <dbl> 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0…
$ aug_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1, 2…
$ sep_2011 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ oct_2011 <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ nov_2011 <dbl> 5, 0, 0, 1, 0, 0, 5, 1, 2, 0, 0, 0, 4, 0, 5, 0, 0, 2, 3, 2…
$ dec_2011 <dbl> 1, 4, 0, 4, 1, 5, 0, 4, 0, 0, 2, 0, 0, 0, 0, 2, 2, 4, 3, 4…
$ jan_2012 <dbl> 3, 0, 0, 0, 0, 3, 0, 0, 4, 4, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0…
$ feb_2012 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, 0, 0, 5, 1, 0, 0, 0, 0…
$ mar_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0…
$ apr_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
$ may_2012 <dbl> 4, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 2…
$ jun_2012 <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 3, 3, 0, 0, 0, 0, 3, 0, 5, 0, 2, 0…
$ jul_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0…
$ aug_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 0, 0, 0, 0, 1, 0, 0, 4, 2, 0…
$ sep_2012 <dbl> 0, 5, 1, 4, 0, 0, 0, 2, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ oct_2012 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0, 0, 0, 0, 0, 0…
$ nov_2012 <dbl> 3, 2, 0, 0, 0, 0, 0, 5, 4, 4, 2, 0, 0, 5, 0, 2, 3, 0, 1, 5…
$ dec_2012 <dbl> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0, 4, 4, 1, 4, 0, 3, 4, 0…
$ jan_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 4, 4…
$ feb_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0…
$ mar_2013 <dbl> 0, 4, 0, 3, 4, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0…
$ apr_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0…
$ may_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ jun_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 3, 5, 0, 0, 0…
$ jul_2013 <dbl> 0, 0, 4, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 4, 0…
$ aug_2013 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ sep_2013 <dbl> 1, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ oct_2013 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ nov_2013 <dbl> 0, 0, 0, 0, 4, 0, 2, 1, 4, 2, 3, 0, 0, 3, 0, 0, 2, 0, 3, 3…
$ dec_2013 <dbl> 0, 4, 4, 0, 4, 2, 0, 0, 4, 0, 0, 3, 2, 3, 3, 2, 0, 3, 2, 2…
$ jan_2014 <dbl> 3, 0, 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0…
$ feb_2014 <dbl> 0, 2, 0, 0, 0, 1, 0, 4, 0, 5, 0, 0, 5, 1, 0, 0, 0, 3, 1, 0…
$ mar_2014 <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2…
$ apr_2014 <dbl> 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0…
$ may_2014 <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 3, 2, 0, 0, 0, 0, 0, 2…
$ jun_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 3, 0…
$ jul_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 4, 0, 0…
$ aug_2014 <dbl> 2, 3, 0, 0, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
$ sep_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 5, 0, 1, 4, 0…
$ oct_2014 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 0, 0, 2…
$ nov_2014 <dbl> 4, 2, 3, 4, 0, 3, 3, 0, 0, 0, 0, 0, 4, 4, 0, 4, 0, 0, 1, 3…
$ dec_2014 <dbl> 3, 0, 4, 0, 0, 1, 0, 0, 0, 3, 4, 0, 3, 0, 0, 0, 0, 2, 0, 5…
$ jan_2015 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ feb_2015 <dbl> 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 0…
$ mar_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0…
$ apr_2015 <dbl> 0, 0, 0, 0, 3, 2, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
$ may_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 4…
$ jun_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 5, 4, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0…
$ jul_2015 <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, 4, 5…
$ aug_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 5…
$ sep_2015 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 3…
$ oct_2015 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 5, 0, 0, 0, 0…
$ nov_2015 <dbl> 3, 0, 0, 0, 1, 4, 0, 3, 4, 3, 0, 0, 1, 4, 0, 0, 0, 0, 4, 4…
$ dec_2015 <dbl> 0, 0, 0, 1, 0, 0, 0, 5, 0, 4, 0, 0, 0, 3, 3, 0, 0, 0, 3, 3…
$ jan_2016 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 3…
$ feb_2016 <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ mar_2016 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 1, 0, 0, 0, 3, 0…
$ apr_2016 <dbl> 3, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2…
$ may_2016 <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3…
$ jun_2016 <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0…
$ jul_2016 <dbl> 4, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0…
$ aug_2016 <dbl> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 3, 0…
$ sep_2016 <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 3, 0…
$ oct_2016 <dbl> 3, 0, 0, 0, 5, 0, 0, 0, 0, 0, 5, 0, 0, 0, 4, 0, 0, 4, 1, 0…
$ nov_2016 <dbl> 5, 0, 0, 4, 2, 3, 3, 0, 0, 3, 0, 0, 0, 4, 0, 2, 0, 2, 0, 0…
$ dec_2016 <dbl> 2, 1, 0, 0, 0, 0, 1, 5, 0, 0, 0, 1, 3, 0, 4, 2, 2, 0, 2, 0…
$ jan_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0…
$ feb_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0…
$ mar_2017 <dbl> 0, 1, 0, 0, 0, 1, 0, 2, 3, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 4…
$ apr_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0…
$ may_2017 <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 0, 1, 2, 0, 0, 2, 0, 3, 3, 0, 4, 3…
$ jun_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 2, 0, 0…
$ jul_2017 <dbl> 3, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3, 3, 0, 0, 0, 2…
$ aug_2017 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0…
$ sep_2017 <dbl> 2, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 5, 4, 0, 0, 0, 0, 0…
$ oct_2017 <dbl> 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0…
$ nov_2017 <dbl> 0, 2, 5, 0, 0, 1, 4, 0, 2, 0, 5, 0, 2, 5, 0, 0, 3, 0, 4, 0…
$ dec_2017 <dbl> 0, 0, 4, 0, 0, 0, 0, 5, 5, 5, 4, 4, 0, 0, 2, 4, 0, 3, 0, 2…
$ jan_2018 <dbl> 1, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 0, 0, 0, 0, 3, 1…
$ feb_2018 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ mar_2018 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 2, 0…
$ apr_2018 <dbl> 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 4, 0, 0…
$ may_2018 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ jun_2018 <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 0, 0, 0, 0, 3…
$ jul_2018 <dbl> 0, 0, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 2, 0, 0, 0, 5…
$ aug_2018 <dbl> 0, 0, 0, 0, 0, 0, 4, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 0, 5…
$ sep_2018 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 5, 0, 2, 0, 0, 0, 0, 2, 4, 0…
$ oct_2018 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ nov_2018 <dbl> 0, 0, 4, 0, 3, 0, 0, 2, 0, 4, 1, 0, 1, 0, 0, 0, 0, 0, 5, 0…
$ dec_2018 <dbl> 0, 5, 0, 0, 2, 0, 0, 4, 0, 0, 0, 4, 0, 4, 1, 0, 0, 0, 2, 0…
A left join keeps all rows in the “left” data frame plus all columns from both data frames. A left join is a good “default” join to start with.
# A tibble: 10,531 × 181
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1001 1971 Female 73000 742. No No South
2 1002 1970 Female 31000 749. Yes No West
3 1003 1988 Male 35000 542. No No South
4 1004 1984 Other 64000 574. Yes Yes Midwest
5 1005 1987 Male 58000 644. No Yes West
6 1006 1994 Male 164000 554. Yes Yes Midwest
7 1007 1968 Male 39000 608. No No Midwest
8 1008 1994 Male 69000 710. No No South
9 1009 1958 Male 233000 702. No No West
10 1010 1994 Female 77000 605. Yes No Northeast
# ℹ 10,521 more rows
# ℹ 173 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>, jan_2005 <dbl>, feb_2005 <dbl>,
# mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, jun_2005 <dbl>,
# jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, oct_2005 <dbl>,
# nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, feb_2006 <dbl>,
# mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, jun_2006 <dbl>, …
An inner join keeps rows that have matching IDs along with all columns from both data frames.
# A tibble: 10,531 × 181
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1001 1971 Female 73000 742. No No South
2 1002 1970 Female 31000 749. Yes No West
3 1003 1988 Male 35000 542. No No South
4 1004 1984 Other 64000 574. Yes Yes Midwest
5 1005 1987 Male 58000 644. No Yes West
6 1006 1994 Male 164000 554. Yes Yes Midwest
7 1007 1968 Male 39000 608. No No Midwest
8 1008 1994 Male 69000 710. No No South
9 1009 1958 Male 233000 702. No No West
10 1010 1994 Female 77000 605. Yes No Northeast
# ℹ 10,521 more rows
# ℹ 173 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>, jan_2005 <dbl>, feb_2005 <dbl>,
# mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>, jun_2005 <dbl>,
# jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>, oct_2005 <dbl>,
# nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>, feb_2006 <dbl>,
# mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, jun_2006 <dbl>, …
We typically want to perform many consecutive function calls. You might be tempted to do the following. Don’t do this!
crm_data_2 <- left_join(customer_data, store_transactions, by = "customer_id")
crm_data_3 <- filter(crm_data_2, region == "West", feb_2005 == max(feb_2005))
crm_data_4 <- mutate(crm_data_3, age = 2022 - birth_year)
crm_data_5 <- select(crm_data_4, age, feb_2005)
crm_data_6 <- arrange(crm_data_5, desc(age))
crm_data_7 <- slice(crm_data_6, 1)
crm_data_7# A tibble: 1 × 2
age feb_2005
<dbl> <dbl>
1 65 5
This would be another way run the same code. Don’t do this either!
Part of the common philosophy for the tidyverse is that:
This allows us to |> together functions in consecutive lines of code so that it is easier for humans to read and less error-prone.
# A tibble: 1 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1335 1978 Female 112000 850 No No West
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
# A tibble: 1 × 13
customer_id birth_year gender income credit married college_degree region
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1335 1978 Female 112000 850 No No West
# ℹ 5 more variables: state <chr>, star_rating <dbl>, review_time <chr>,
# review_title <chr>, review_text <chr>
|> do?|>?%>%?|>.Read |> as then. (If we need to use <-, we typically put it at the beginning.)
Quarto is an open-source scientific and technical publishing system built on Pandoc.
You can write a report or presentation in Word, PowerPoint, etc.; then run code in an R script; then save output and visualizations; and then paste them in the report or the presentation…
…or you can do it all at once in Quarto.
Quarto is multilingual, used with R, Python, Julia, and Observable from RStudio, VS Code, and Jupyter.
Select the right branch first! Open up PROJECT.qmd.
The header is where (most of) the metadata lives in YAML.
---
title: "Project Name"
format: gfm
---
While R Markdown can produce HTML documents, PowerPoint slides, etc., format: gfm (i.e., GitHub-flavored markdown) is what will produce the associated .md files that GitHub reads when rendered. When you make a presentation, it will be format: revealjs (see your repo’s README for details).
## headings and ### sub-headings.functions and data in the text with `s, italics with *s, and bold with **s.Code cells are language-specific.
#| echo: fenced
transcripts <- read_rds(here::here("Data", "transcripts.rds")) |>
tibble() |>
separate(doc_id, into = c("gvkey", "call_date", "title"), sep = "_")The hashpipe operator #| allows you to specify metadata using YAML for the code cell specifically.
Divs and Spans are generic HTML/CSS elements that are used to format a document.
This is incredibly useful if you want to format an entire block of content, such as a paragraph. For example, using the .callout-warning class, I can quickly make this:
Warning
Pay attention to this information!
Using your seminar repo and week 1’s branch, modify PROJECT.qmd by drafting a paragraph-long “abstract,” a description of the kind of project you’d like to work on. A few things to consider:
Render PROJECT.qmd into PROJECT.md once you’re finished. Commit changes and push to GitHub. On GitHub, create a pull request and merge week 1’s branch into main and delete the branch when finished.
Get started on next week’s presentation by picking and reading one paper from the shared PhD seminar syllabus.